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The use of image transformations is essential for efficient modeling and learning of 
visual data. But the class of relevant transformations is large: affine transformations, 
projective transformations, elastic deformations, ... the list goes on. Therefore, 
learning these transformations, rather than hand coding them, is of great conceptual 
interest. To the best of our knowledge, all the related work so far has been concerned 
with either supervised or weakly supervised learning (from correlated sequences, 
video streams, or image-transform pairs). In this paper, on the contrary, we present 
a simple method for learning affine and elastic transformations when no examples of 
these transformations are explicitly given, and no prior knowledge of space (such as 
ordering of pixels) is included either. The system has only access to a moderately 
large database of natural images arranged in no particular order. 



I. INTRODUCTION 



Biological vision remains largely unmatched by artificial visual systems across a wide 
range of tasks. Among its most remarkable capabilities are the aptitude for unsupervised 
learning and efficient use of spatial transformations. Indeed, the brain's proficiency in vari- 
ous visual tasks seems to indicate that some complex internal representations are utilized to 
model visual data. Even though the nature of those representations is far from understood, 
it is often presumed that learning them in an unsupervised manner is central to the biolog- 
ical neural processing [2] or, at very least, highly relevant for modeling neural processing 
computationally [9l |22]. Likewise, it is poorly understood how the brain implements 
various transformations in its processing. Yet it must be clear that the level of learning 
efficiency demonstrated by humans and other biological systems can only be achieved by 
means of transformation-invariant learning. This follows, for example, from an observation 
that people can learn to recognize objects fairly well from only a small number of views. 

Covering both topics (unsupervised learning and image transformations) at once, by way 
of learning transformations without supervision, appears interesting to us for two reasons. 
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Firstly, it can potentially further our understanding of unsupervised learning: what can 
be learned, how it can be learned, what are its strengths and limitations. Secondly, the 
class of transformations important for representing visual data may be too large for manual 
construction. In addition to transformations describable by a few parameters, such as affine, 
the transformations requiring infinitely many parameters, such as elastic, are deemed to be 
important [Ij. Transformations need not be limited to spacial coordinates, they can involve 
temporal dimension or color space. Transformations can be discontinuous, can be composed 
of simpler transformations, or can be non-invertible. All these cases are likely to be required 
for efficient representation of, say, an animal or person. Unsupervised learning opens the 
possibility of capturing such diversity. 

A number of works have been devoted to learning image transformations pTHTSf [T614T8] . 
Other works were aimed at learning perceptual invariance with respect to the transforma- 
tions [5l Uni [21] , but without explicitly extracting them. Often, no knowledge of space 
structure was assumed (such methods are invariant with respect to random pixel permuta- 
tions), and in some cases the learning was termed unsupervised. In this paper we adopt a 
more stringent notion of unsupervised learning, by requiring that no ordering of an image 
dataset be provided. In contrast, the authors of the cited references considered some sort 
of temporal ordering: either sequential (synthetic sequences or video streams) or pairwise 
(grouping original and transformed images). Obviously, a learning algorithm can greatly 
benefit from temporal ordering; just like ordering of pixels opens the problem to a host of 
otherwise unsuitable strategies. Ordering of images provides explicit examples of transfor- 
mations. Without ordering, no explicit examples are given. It is in this sense that we talk 
about learning without (explicit) examples. 

The main goal of this paper is to demonstrate learning of affine and elastic transformations 
from a set of naturals images by a rather simple procedure. Inference is done on a moderately 
large set of random images, and not just on a small set of strongly correlated images. The 
latter case is a (simpler) special case of our more general problem setting. 

The possibility of inferring even simple transformations from an unordered dataset of 
images seems intriguing in itself. Yet, we think that dispensing with temporal order has 
a wider significance. Temporal proximity of visual percepts can be very helpful for learn- 
ing some transformations but not others. Even the case of 3D rotations will likely require 
generation of hidden parameters encoding higher level information, such as shape and ori- 
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entation. That will likely require processing a large number of images off-line, in a batch 
mode, incompatible with temporal proximity. 

The paper is organized as follows. A brief overview of related approaches is given in 
section HIl Our method is introduced in section IIIII In section HV] the method is tested on a 
synthetic and natural sets of random images. In section |V] we conclude with discussion of 
limitations and possible extensions of the current approach, outlining a potential application 
to the learning of 3D transformations. 

II. RELATED WORK 

It is recognized that transformation invariant learning, and hence transformations them- 
selves, possess great potential for artificial cognition. Numerous systems, attempting to 
realize this potential, have been proposed over the last few decades. In most cases the trans- 
formation invariant capabilities were bult-in. In the context of neural networks, for example, 
translational invariance can be built-in by constraining weights of connections [HI HO] . Some 
researchers used natural image statistics to infer the underlying structure of space without 
inferring transformations. For example, ideas of redundancy reduction applied to natu- 
ral images, such as independent component analysis or sparse features, lead to unsupervised 
learning of localized retinal receptive fields [I] and localized oriented features, both in spatial 
[15] and spatio-temporal [20] domains. 

As we said, transformation (or transformation- invariant) learning has so far been im- 
plemented by taking advantage of temporal correlation in images. In Refs. [5l [191 l2T] 
transformation-invariant learning was achieved by incorporating delayed response to stimuli 
into Hebbian-like learning rules. 

By explicitly parametrizing affine transformations with continuous variables it was possi- 
ble to learn them first to linear order in Taylor expansion [16] and then non-perturbatively 
as a Lie group representation [131 HZl HI]- In the context of energy-based models, such as 
Boltzmann machines, transformations can be implemented by means of three-way interac- 
tions between stochastic units. The transformations are inferred by learning interaction 
strengths [TH [T2]. In all these cases the corresponding algorithms are fed with training 
examples (of possibly several unlabeled types) of transformations. Typically, images do not 
exceed 40 x 40 pixels in size. 
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Below we demonstrate that image transformations can be learned without supervision, 
and without temporal ordering of training images. We consider both synthetic and natural 
binary images, achieving slightly better result for the synthetic set. Transformations are 
modeled as pixel permutations in 64 x 64 images. We see many possible modifications to 
our algorithm enabling more flexible transformation representation, more efficient learning, 
larger image sizes, etc. These ideas are left for future exploration. In the current manuscript, 
our main focus is on showing the feasibility of the proposed strategy in its basic incarnation. 

III. LEARNING TRANSFORMATIONS FROM UNORDERED IMAGES 

The basic idea behind our algorithm is extremely simple. Consider a pair of images 
and a transformation function. Introduce an objective function characterizing how well 
the transformation describes the pair, treating it as an image-transform pair. Minimize 
the value of the objective function across a subset of pairs by modifying the subset and 
the transformation incrementally and iteratively. The subset is modified by finding better- 
matching pairs in the original set of images, using fast approximate search. We found that a 
simple hill climbing technique was sufficient for learning transformations in relatively large 
64 X 64 images. Bellow we describe the algorithm in more detail. 

A. Close match search 

Let iS be a set of binary images of size L x L. We sometimes refer to S as the set of 
random images. The images are random in the sense that they are drawn at random from 
a much larger set A^, embodying some aspects of natural image statistics. For example, 
A/" could be composed of: a) images of a white triangle on black background with integer- 
valued vertex coordinates {\J^\ = L^/3\ images), h) L x L patches of (binarized) images 
from the Caltech-256 dataset |8]. We will consider both cases. Notice that our definition of 
S implies that it needs to be sufficiently large to contain pairs of images connectable by a 
transformation of interest. Otherwise such transformation cannot be learned. 

To learn a transformation at L = 64 we will need \S\ to be in the order of 10^ — 10^, 
with the number of close match searches in the order of 10^ — 10^. Clearly, it is crucial to 
employ some efficient search technique. In a wide class of problems a significant speedup can 
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be achieved by abandoning exact nearest neighbor search in favor of approximate nearest 
neighbor search, with httle loss in quahty of performance. Our problem appears to belong to 
this class. Therefore, approximate algorithms, such as best bin first [3j or locality sensitive 
hashing (LSH) [7], are potential methods of choice. LSH seems especially suitable thanks 
to its ability to deal with high-dimensional data, like vectors of raw pixels. On the flip side, 
LSH requires estimation of optimal parameters, which is typically done with an implicit 
assumption that the query point is drawn from the same distribution as the data points. 
Not only is that not the case here, the query distribution itself changes in the course of 
the algorithm run. Indeed, in our case the query is the image transform under the current 
estimate of the transformation. It gradually evolves from a random permutation, to some- 
thing approximating a continuous 2D transformation. To avoid these complications we opt 
for storing images in binary search trees, while also creating multiple replicas of the tree to 
enhance performance in the spirit of LSH. Details are given below, but first we introduce a 
few notations. 

Let a Lx L binary image be represented by a binary string x = xi...Xl2, where Xi encodes 
the color (0=black, l=white) of the pixel in the i-th position (under some reference ordering 
of pixels). Let o be an ordering of pixels defined as a permutation relative to the reference 
ordering. Given o, the image is represented by the string x(o) = Xo(i)..-Xo{l^). We will refer 
to an image and its string representation interchangeably, writing x/(o) to denote an image 
/. 

Let B{o) be a binary search tree that stores images I E S according to (lexicographic) 
order of x/(o). Rather than storing one image per leaf, we allow a leaf to contain up to m 
images (that is any subtree containing up to m images is replaced by a leaf). We construct / 
versions of the tree data structure, each replica with a random choice of Oj, i = 1, This 
replication is intended to reduce the possibility of a good match being missed. A miss may 
happen if a mismatching symbol (between a query string and a stored string) occurs too soon 
when the tree is searched. Alternatively, one could use a version of A* search, as in the best 
bin first algorithm, tolerating mismatched symbols. However, our empirical results suggest 
that the approach benefits from tree replication, possibly because information arrives from 
a larger number of pixels in this case. 

To find a close match to an image /, we search every binary tree B{oi) in the usual way, 
using x/(oj) as query string. The search stops at a node n if: a) n is a leaf- node, or b) search 
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cannot proceed further {n lacks an appropriate branch). All the images from the subtree 
rooted at n are returned. In the final step we compute distance to every returned candidate 
and select the closest to / image. 

In short, in our close match search algorithm we use multiple binary search trees, with 
distinct trees storing images in distinct random orderings of pixels. The described approx- 
imate search yields a speedup of \S\/ml over the exact nearest neighbor search. For the 



values m = 5 and / = 10 that we used in our experiments (see section IV), the speedup was 
about 10^ - 10^. 



B. Transformation optimization 

We define image transformation T as a permutation of pixels t. That is Tx = x(t). De- 
spite obvious limitations of this representation for describing geometric transformations, we 
will demonstrate its capacity for capturing the essence of affine and elastic transformations. 
To be precise, our method in the current formulation can only capture a volume-preserving 
subset of these transformations. But removing this limitation should not be too difficult 
(see section [V| for some discussion). 

We denote a pair of images as (/, /') or (x/, x//). The Hamming distance between strings 
X and x' is defined as (i(x, x') = ~ ^'iY- The objective function dx, describing how 

well a pair of images is connected by the transformation T, is defined as: 

dT{IJ')=d{Ti^j,i^r). (1) 

Thus, the objective function uses the Hamming distance to measure dissimilarity between 
the second image and the transform of the first image. 

We will be minimizing dx across a set of pairs, which we call the pair set and denote it 
V. The objective function Dt over V is defined as: 

D^^Y^Mp). (2) 

where we used a shorthand notation p for a pair from the pair set. We refer to the mini- 
mization of Dt while V is fixed and T changes as the transformation optimization. We refer 
to the minimization of Dt while T is fixed and V changes as the pair set optimization. 

In the transformation optimization phase the algorithm attempts to minimize Dt by 
incrementally modifying T. A simple hill climbing is employed: a pair of elements from t are 
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exchanged at random, the modification is accepted if Dt does not increase. A transformation 
modification effects every transformed string x(t) from V. However, there is an economical 
way of storing the pair set that makes computation of Df particularly fast. Consider the 
first images of all pairs from V. Consider a matrix whose rows are these images. Let Xj 
be z-th column in this matrix. Define similarly x^, by arranging the second images in the 
same order as their pair counterparts. The objective function expressed through these vector 
notations then reads: 



j=i 

If the z-th and j-th elements of t are exchanged, then the corresponding change {ADT)ij in 
the objective function reads: 



which involves computing four terms of the form xjx'^,. For a binary image, as in our 
case, the vectors are binary strings and their dot-products can be computed efficiently using 
bitwise operations. Notice also that the vectors in Eq.([3]) are unchanged throughout the 
transformation optimization phase, only t is updated. A transformation optimization phase 
followed by a pair set optimization phase constitutes one iteration of the algorithm. There 
are Ut attempted transformation modifications per one iteration. 



The goal of the pair set optimization is twofold. On one hand, we want V to contain pairs 
that minimize Dt- On the other hand, we would like to reduce the possibility of getting 
stuck at a local minimum of Dt- To achieve the first goal, we update V by adding new 
pairs, ranking all pairs according to (It and removing the pairs with highest (It- To add a 
new pair (/, /') we pick image / at random from S, then search for /' as a close match to 
Tx/. To achieve the second goal, we add stochastic noise to the process by throwing out 
random pairs from the pair set. 

We denote n„ and Ur the number of newly added pairs and the number of randomly 
dropped pairs respectively, both per one iteration. After n„ pairs are added and rir pairs 
are dropped, we remove n„ — rir pairs with highest (It, so that the number of pairs \V\ in 
the pair set remains unchanged. 




(3) 




(4) 



C. Pair set optimization 
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D. Summary of the algorithm 



We briefly summarize our algorithm in Alg. III.l First, T and V are randomly initialized, 
then the procedure MINIMIZED (T, V) is called. It stochastically minimizes by alternating 
between transformation optimization and pair set optimization for total of rii iterations. 



Algorithm III.l: MINIMIZED (T, P) 



for 1 to rii 



do < 



do < 



for 1 to Tit 

^ (rand(L^), rand(L^)) 
exchange(z, j, T) 

if DELTAD(i, j,T,P) > 
then EXCHANGE(z, j, T) 
ADDPAIRS(n„,'P) 
DROPPAIRS(nr,'P) 
REMOVEPAIRS(n„ - nr,V) 



Calls to other procedures should be self-explanatory in the context of the already provided 
description: RAND(?2) generates a random integer in the interval EXCHANGE(i, j, T) 

exchanges the i-th and j-th elements of t, DELTAD(i, j, T, V) computes {ADT)ij according to 
Eq.Q; finally, ADDPairs(?2, P), DROPPairs(?2, P) and REMOVEPairs(?2, P) adds random, 
drops random, and removes worst performing (highest (It) n pairs respectively, as explained 
in subsection IIII C[ 

As is often the case with greedy algorithms, we cannot provide guarantees that our 
algorithm will not get stuck in a poor local minimum. In fact, due to the stochasticity of 
the pair set optimization, discussing convergence itself is problematic. Instead, we provide 
convincing empirical evidence of the algorithm's efficacy by demonstrating in the next section 
how it correctly learns a diverse set of transformations. 
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IV. RESULTS 

We tested our approach on two image sets: a) synthetic set of triangles, b) set of natural 
image patches. These experiments are described below. 

A. Triangles 

Edges and corners are among the commonest features of natural scene images. Therefore 
a set of random triangles is a good starting point for testing our approach. The set S is drawn 
from the set JV of all possible white triangles on black background, whose vertex coordinates 
are restricted to integer values in the range [0, L). For convenience, we additionally restricted 
S to contain only images with at least 10% of minority pixels. This was done to have better- 
balanced search trees, and also to increase informational content of S, since little can be 
learned from little- varying images. 

Our goal was to merely demonstrate that this approach can work, therefore we did not 
strive to find best possible parameters of the algorithm. Some parameters were estimated |23]. 
and some were found by a bit of trial and error. The parameters we used were: L = 64, 
m = 5, / = 10, \S\ = 30000, \V\ = 200, rit = 10000, n„ = 10, = 1, = 3000. 

We want to show that the algorithm can learn without supervision multiple distinct trans- 
formations that are representative of S. The simplest strategy is to generate transformations 
starting from random T and V, eliminating samples with higher Dt to minimize the chance 
of including solutions from poor local minima. For more efficiency, compositions of already 
learned transformations can be used as initial approximations to T. Compositions can also 
be chosen to be far from learned samples. We found that for L = 64 poor solutions occur 
rarely, in less than approximately 10% of cases. By poor we mostly mean a transformation 
that appears to have a singularity in its Jacobian matrix. We chose to generate about three 
quarters of transformations from nonrandom initial T, setting rii = 1000 in such cases. Half 
of all generated samples were kept. In this way the algorithm learned about half a hundred 
transformations completely without supervision. 

All learned transformations looked approximately affine. Selected representative exam- 
ples (for better quality additionally iterated with rii = 5000, \V\ = 300 and \S\ = 100000) 
are shown in Fig.([lja). Since the human eye is very good at detecting straight parallel 



10 




FIG. 1: Visualization of learned transformations for the set of triangles (a,b) and natural image 
patches (c). a) Selected examples of affine transformations, b) Set of learned transformations 
projected onto various planes in parameter space, top to bottom: bx-by, Sx- A, X-e, 0-Sx. c) A 
typical transformation visualized by 4 x 4 checkerboard pattern. 

lines, we deemed it sufficient to judge quality of the learned affine transformations by visual 
inspection of the transforms of appropriate patterns. The transformations are visualized by 
applying them to a L x L portrait picture and checkerboard patterns with check sizes 32, 
16, 8, 4 and 2. Since the finest checkerboard pattern is clearly discernible, we conclude that 
the achieved resolution is no worse than 2 pixels. With our choice of representing trans- 
formations as pixel permutations it is difficult to expect a much better resolution. Other 
consequences of this choice are: a) all the captured transformations are volume preserving, 
b) there are white-noise areas that correspond to pixels that should be mapped from out- 
side of the image in a proper affine transformation. Nonetheless, this representation does 
capture most of the aspects of affine transformations. To better illustrate this point we plot 
in Fig([T|b) values of various parameters of all the learned examples. The parameters of an 
affine transformation ^' = AC, + b are computed using: 

A = C^'dCi^)-\ h = ii^,- A/ig, (5) 
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Where C and /i are the covariance matrix and the mean: Cab = {ab^) — fJ'afJ'J and fia = (o)- 
The averaging (...) is weighted by a Gaussian with standard deviation a = .IL centered at 
(L/2,L/2). The weighting is needed because our representation cannot capture an affine 
transformation far from the image center. We further parametrize A in terms of a consecu- 
tively apphed scahng S, transvection A and rotation R. That is A = RAS where: 



/ 1 




















V Sy) 











COS 6 — sin 6 \ 

(6) 

sin 6 cos 6 J 



The parameters Sx, Sy, X and 6 expressed in terms of A are: Sx = a/ Af^ + Al^, Sy = 
Det(A)/s^, A = iAiiAi2 + A2iA22)/isxSy) and 6 = atan2(A2i, An), where Det(A) = 
A\\A22 — A21A12. From Fig.([l}b) we see that the parameter values are evenly distributed 
over certain ranges without obvious correlations. Unexplored regions of the parameter space 
correspond to excessive image distortions, with not many images in S connectable by such 
transformations at reasonable cost. Also, |Det(yl)| across all transformations was found to 
be .998 ± .004, validating our claim of volume preservation. 



B. Natural image patches 

In the second experiment we learned transformations from a set of natural images, derived 
from the Caltech-256 dataset [8]. The original dataset was converted to binary images using 
k-means clustering with k = 2. Non-overlaping L x L patches with minority pixel fraction 
of at least 10% were included in Af. We had \Af \ ~ 500000. We used the following algorithm 
parameters: L = 64, m = 5, / = 10, \S\ = 200000, \V\ = 1000, Ut = 10000, = 20, Ur = 1, 
rii = 5000. 

Natural images are somewhat richer than the triangle set, consequently the transfor- 
mation we learned were also richer. Typical transformation looked like a general elastic 
deformation, often noticeably differing from an affine transformation. White noise areas 
were much smaller or absent, while the resolution was lower at about 3 pixels. A typical 
example is shown in Fig.([l]c). 
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V. DISCUSSION AND CONCLUSION 

In this paper we have demonstrated conceptual feasibihty of learning image transforma- 
tions from scratch: without image set or pixel set ordering. To the best of our knowledge 
learning transformations from unordered image dataset has never been considered before. 
Our algorithm, when applied to natural images, learns general elastic transformations, of 
which affine transformations are a special case. 

For the sake of simplicity we chose to represent transformations as pixel permutations. 
This choice restricted transformations by enforcing volume conservation. In addition, it 
adversely affected the resolution of transformations. We also limited images to binary form, 
although the learned transformations can be applied to any images. Importantly, we do not 
see any reason why our main idea would not be applicable in the case of a general linear 
transformation acting on continuously- valued pixels. In fact, the softness of continuous 
representation may possibly improve convergence properties of the algorithm. We plan 
to explore this extension, expecting it to capture arbitrary scaling transformations and to 
increase the resolution of learned transformations. 

Images that we considered were relatively large by standards of the field. For even larger 
images chances of getting trapped in a poor local minimum increase. To face this challenge 
we can propose a simple modification. Images should be represented by a random subset 
of pixels. Learning should be easy with a small initial size of the subset. In this way one 
learns a transformation at a coarse grained level. Pixels then are gradually added to the 
subset, increasing the transformation resolution, until all pixels are included. Judging from 
our experience, this modification will allow tackling much larger images. 

It seems advantageous for the efficiency of neural processing to factor high dimensional 
transformations, such as affine transformations, into more basic transformations. How the 
learned random transformations can be used to that end is another interesting problem. 

In our view, 3D rotations Rr] rj' can be learned in a similar fashion as we learned affine 
transformations, with orientations rj playing role of pixels in the current work. The problem 
however is much harder since we do not have direct access to hidden variables rj. Indirect 
access is provided through projected transformations A{R, rj), where set of A is presumed to 
have been learned (apart from its dependence on the arguments R and rj). Wc believe that 
the presence of multiple orientations in a given image and multiple images should constrain 
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it! and A sufficiently for tfiem to be learnable. 

To conclude, we consider the presented idea of unsupervised learning of image transfor- 
mation novel and valuable, opening new opportunities in learning complex transformations, 
possibly tackling such difficult cases as projections of 3D rotations. 
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